Naive Bayes

Introduction

Imagine you have just made a subscription to an online movie streaming service, such as HBO or Netflix. Some weeks after your subscription, you have watched 20 movies on this platform, out of which, you liked 15 and disliked the other 5 (in practice, there will be movies where you express no opinion, but let’s assume this is not the case for now). If no other information is available about you, the service could argue that the probability that you like a movie would be 75% (15 / 20). However, what if out of the 15 movies that you liked, 14 of them were suggested by a close friend of yours, with whom it happens that you share similar interests? What if your friend discouraged you to watch the other 5 movies, out of which you liked only one? It is not difficult to understand that, by taking more information into account (e.g., your friend’s suggestion), the estimated probability of you liking the streaming service movies changes. According to the data, if you were to watch a movie that your friend suggested, we would estimate that the probability that you will actually like it would arguably be higher than 75%.

In this chapter, we attempt a short primer on probability essentials as a precursor to one of the most useful machine learning methods for classification, called Naive Bayes. This method is based on the Thomas Bayes’ Theorem which is the basis for conditional probability analysis. The term “naive” comes from the fact that we make a very “naive” assumption, which in reality does not hold in most applications (Lantz, 2023). Nonetheless, this should not discourage us from using this method in practice; we discuss more details about this assumption later in this chapter.

Visiting Probability Theory

Let us revisit the above case of online movie platform subscription. Whether you like or dislike a movie or a series can be thought of as an event for which there are only two possible outcomes: you either like the movie or the TV series, or you may not. In the absence of any additional information, the probability that you actually like any given movie or TV series is represented as:

\[P(\text{Like})\]

In our example, our starting point was that, out of 20 movies, we liked 15. Respectively, the probability that we like any given movie is estimated as \(P(\text{Like}) = 75\%\). Of course, if \(P(\text{Like}) = 75\%\), then \(P(\text{Dislike}) = 25\%\) (remember, the sum of the probabilities of all possible outcomes of an event should always equal 100%). In probability theory, we say that these two outcomes (like a movie and dislike a movie) are mutually exclusive and collectively exhaustive (MECE).

But what if we had additional information? As mentioned in our earlier example, it makes sense to think that the probability that you like a movie on the online platform increases if that movie is recommended by your friend. Things start to become more complicated when we want to calculate the probability of an event based on the probability of (many) other events.

Would the probability of you liking a movie still increase if another event were to happen, such as you flipped a coin and got heads? In probability theory, it is always important to know whether two events are independent or dependent. We say that two events are independent when they are unrelated to each other or the outcome of one does not affect the outcome of the other. For instance, the probability that you like a particular movie is independent from getting heads (or getting tails for that matter) when flipping a coin. The probability that you like a movie is still \(P(\text{Like}) = 75\%\) and the probability that you get heads (assuming that the coin is fair) is \(P(\text{Heads}) = 50\%\). Because these two events are independent, we can only calculate the joint probability that you both like the movie and you get heads. To calculate the probability of two independent events, we simply multiply the probability of one outcome with the probability of the other. Therefore, mathematically, we have:

\[P(\text{Like} \cap\ \text{Heads}) = P(\text{Like}) \times P(\text{Heads}) = 75\% \times 50\% = 37.5\%\]

Thus, the probability for these two outcomes happening simultaneously is 37.5%. Once again, we calculate the probability using the multiplication of individual probabilities because the underlying events are independent.

At this point, it makes sense to ask what changes if two events are dependent. By definition, events would be dependent if the outcome of the one event is related to that of the other. In those cases, the probability of an event can be calculated based on the Thomas Bayes’ Theorem that describes the probability of an event based on prior knowledge that could be related to the occurrence of that event. In our example, the prior knowledge is your friend’s suggestion. Mathematically, we can represent this probability in the following way:

\[P(\text{Like} | \text{FriendSuggests})\]

The vertical bar “|” is read as “given”. This mathematical expression is also known as conditional probability and, in the context of this example, it describes the probability that you like a movie given that your friend is suggesting it (or has suggested it) for you to view. If your friend does not suggest the movie, then we would have \(P(\text{Like} | \text{FriendNOTSuggests})\). The other two possible outcomes can be expressed as: \(P(\text{Dislike} | \text{FriendSuggests})\) and \(P(\text{Dislike} | \text{FriendNOTSuggests})\). To understand how we could calculate the \(P(\text{Like} | \text{FriendSuggests})\), we need to have a look at the following frequency table:

	Like	Dislike	Total
FriendSuggests	14	1	15
FriendNOTSuggests	1	4	5
Total	15	5	20

As we discussed earlier, out of the 15 movies that you liked, 14 of them were suggested by your friend and out of the 5 movies that you did not like, only one was suggested by your friend.

Now, if we know that your friend suggested a movie, we need to look at the first row of this table: we can see that out of the 15 movies that your friend suggested, you liked 14 of them. Therefore, we get:

\[P(\text{Like} | \text{FriendSuggests}) = \frac{14}{15} = 93\%\]

This is a much higher probability, and can be estimated because your friend’s suggestions have way more information about whether you would like a particular movie or not. We could calculate the same probability with Thomas Bayes’ formula. In our example, this would be the following:

\[P(\text{Like} | \text{FriendSuggests}) = \frac{P(\text{FriendSuggests} | \text{Like}) \times P(\text{Like})}{P(\text{FriendSuggest})} = \frac{P(\text{Like} \cap \text{FriendSuggests})}{P(\text{FriendSuggest})}\]

An important note is that the joint probability \(P(\text{Like} ∩ \text{FriendSuggests})\) is equal to \(P(\text{FriendSuggests} | \text{Like}) \times P(\text{Like})\), not to \(P(\text{Like}) \times P(\text{FriendSuggests})\), due to the fact that the events are dependent. Regarding the calculation, we have:

\[P(\text{Like} | \text{FriendSuggests}) = \frac{P(\text{FriendSuggests} | \text{Like}) \times P(\text{Like})}{P(\text{FriendSuggest})} = \frac{\frac{14}{15} \times \frac{15}{20}}{\frac{15}{20}} = 93\%\]

Before moving on to see how to apply the Naive Bayes method in R, it is important to clarify how we could know whether two events are independent or dependent. This is an important clarification because, as we will discuss later, the main assumption of Naive Bayes depends on this characteristic.

Independence of Events

We can argue that two events (\(A\) and \(B\)) are independent if \(P(A∩B) = P(A) \times P(B)\).

Since we just multiply the probabilities of the two events, it makes no difference if, instead, we have \(P(B∩A) = P(B) \times P(A)\); the outcome would be exactly the same. If, instead, the probability of one event influences the probability of the other, then:

\(P(A∩B) = P(B|A) \times P(A)\)
\(P(B∩A) = P(A|B) \times P(B)\)

Using our previous example, we can imagine that the probability you like the movie, given your friend recommends it, is different from the probability that your friend recommends a movie, given you like it. Thus, not only the joint probability is different when we have dependent events, but also the order of the events (which event influences which) matters.

Naive Bayes Method

The Thomas Bayes’ formula is the base of the Naive Bayes method in machine learning. Essentially, we can create a model that can be used for prediction, starting from pre-existing (or pre-measured) independent variables. In our movie suggestion example, we saw mathematically how we could incorporate the information regarding your friend’s suggestion to make way more accurate estimations. Without the information about your friend’s suggestion, we saw that the probability you would like a movie would be 75%.

However, things start to become complicated as we include more and more information. For instance, what if we have suggestions from 10 friends instead of just 1? In that case, we would have 10 different events and we would need to calculate all these joint probabilities to apply Bayes’ formula. In other words, it would be too difficult and computationally expensive to apply Bayes’ formula to calculate all these conditional probabilities. Instead, Naive Bayes algorithm makes a very strong, yet “naive” assumption, about the relationships among the events: that all events are independent. We saw earlier that when we calculate independent events, we just multiply each others’ probabilities, without further consideration; indeed, this is how the Naive Bayes algorithm estimates the probability of an observation of interest.

We need to keep in mind that there might be combinations of events for which the joint probability is—in fact—zero. For instance, let’s imagine that there was a movie that we decided to not watch even if it was recommended by all of our 10 friends. Since we will not watch that movie, we will not be able to say whether we liked or disliked it. In such cases, the nominator of the Bayes’ formula would be equal to zero. This is a big problem as we want to include sufficient information of past events in order to improve our estimates. To solve this problem, we use what is known as the Laplace estimator. The Laplace estimator just adds a small number (we typically set it equal to 1 while other frequencies would be much higher) to each possible combination of events (or for each cell in a frequency table). With this trick, we deal with this potential problem by assuming that the probability of any given event is not zero, yet quite insignificant.

Although Naive Bayes method uses only nominal or categorical independent variables to solve classification problems (the dependent variable is also nominal or categorical), we could indirectly also use continuous variables as predictors. The way to do that is to transform our numeric variables into categorical ones. We may accomplish this either by visualizing the distribution of a numeric variable and creating different bins “by hand”, use purpose-built machine learning algorithms, or map the categories based on scientific research or practical implications.

Assumptions

Naive Bayes employs mainly two assumptions. For one, we discussed how event independence is at the core of this machine learning method while, second, we assume that all events considered are equally important (we do not give a priority to one event over another one). Let us explicitly mention these assumptions below:

Conditional Independence: This is the “naive” part of Naive Bayes. It assumes that the presence (or absence) of a particular feature is independent from the presence (or absence) of other features.
Feature Relevance: All features are relevant and equally important.

In our movie recommendation example, we assume that the recommendation of one friend is independent from the recommendation of all other friends (one’s opinion does not affect and is not affected by the opinion of the others), and that all recommendations are equally important (we do not value any one recommendation more than any other).

Naive Bayes in R

Let us explore how to use the Naive Bayes method by using the customer_churn dataset, available on GitHub. Using this dataset, we will use the categorical variables Recency_Level, Frequency_Level and Monetary_Value_Level to predict whether a customer churns or not. As we need our target variable to be a factor, we adjust the Churn column to have the value "Churn" if the original value of the column Churn is 1 and the value of "Not Churn" if it is 0.

To predict whether a customer churns or not using Naive Bayes, we use the package naivebayes, which we load along with the readr package and the dplyr package. In the code below, we load those packages, select the column and create the Churn_Label column, as mentioned:

# Libraries
library(readr)
library(dplyr)
library(naivebayes)

# Importing customer_churn
customer_churn <- read_csv("https://raw.githubusercontent.com/DataKortex/Data-Sets/refs/heads/main/customer_churn.csv")

# Preparing and selecting target variable and features
customer_churn <- customer_churn %>% 
  mutate(Churn = as.factor(if_else(Churn == 1, "Churn", "Not Churn"))) %>%
  select(Recency_Level, Frequency_Level, Monetary_Value_Level, Churn)

# Printing the first 10 rows
head(customer_churn, n = 10)

# A tibble: 10 × 4
   Recency_Level Frequency_Level Monetary_Value_Level Churn    
   <chr>         <chr>           <chr>                <fct>    
 1 Low           Low             Medium               Not Churn
 2 Low           Medium          High                 Not Churn
 3 Low           High            High                 Not Churn
 4 Low           Low             Medium               Not Churn
 5 Medium        Low             Low                  Not Churn
 6 Medium        Low             Medium               Not Churn
 7 Low           Low             Medium               Not Churn
 8 High          Low             High                 Churn    
 9 Medium        Low             Medium               Not Churn
10 High          Low             Medium               Churn

To keep things simple, we split the dataset into a training and a test set, with 75% (or 3,000 rows) and 25% (or 1,000 rows) of the observations respectively. Importantly, this is not a methodological mistake since the order of rows in the dataset is random and does not carry any inherent meaning:

# Training and test sets
training_set <- customer_churn %>% slice(1:3000)
test_set <- customer_churn %>% slice(3001:4000)

We can now create a predictive model based on the Naive Bayes method with the function naivebayes(). With this function, we need to specify the target variable (\(y\)) and the predictors (\(x\)) in the following form:

\[y \sim x\]

Additionally, we set the data argument to the training_set object and the laplace argument to 1, to make sure that no joint probability is equal to 0. We therefore have the following code:

# Training Naive Bayes classifier
naive_bayes_classifier <- 
  naive_bayes(Churn ~ 
                Recency_Level +
                Frequency_Level + 
                Monetary_Value_Level,
              data = training_set, 
              laplace = 1
              )

The (resulting) object naive_bayes_classifier is our machine learning model, which we can use to make predictions on the test set. However, it is interesting to check the results on the training set. To make predictions with our model, we use the predict() function on the training set and store our values in an object (in this example, we name this object p_training):

# Predictions on the training set
p_training <- predict(naive_bayes_classifier, training_set)

We can check immediately the accuracy of our results like this:

# Accuracy on the training set
mean(p_training == training_set$Churn)

[1] 0.8503333

Although the accuracy does not seem very impressive given that we fit our classifier on the same dataset, let’s check the accuracy on the test set:

# Predictions on the test set
p_test <- predict(naive_bayes_classifier, test_set)

# Accuracy on the test set
mean(p_test == test_set$Churn)

[1] 0.867

The accuracy is higher (though slightly) in the test set than in the training set. This shows us that we can have low variance across different samples when we apply the Naive Bayes method, which means that we are avoiding overfitting our model (see Chapter Introduction to Machine Learning).

Instead of values of classes, we could have the probability output by setting the argument type in the predict() function to "prob":

# Predictions on the test set - Probabilities
p_test <- predict(naive_bayes_classifier, test_set, type = "prob")

# Prediction Output
head(p_test)

           Churn Not Churn
[1,] 0.034576754 0.9654232
[2,] 0.143694665 0.8563053
[3,] 0.154079889 0.8459201
[4,] 0.034576754 0.9654232
[5,] 0.001876934 0.9981231
[6,] 0.001821598 0.9981784

The output in this case is a two column matrix, showing the probabilities of each class. We can therefore provide the estimated probabilities rather than a prediction class.

Although this case study might appear simple, its purpose was to demonstrate how we could use the Naive Bayes method to create a model to be used for making predictions. In addition, we observed that the prediction accuracy was approximately the same in the training and the test set, an observation that suggests that Naive Bayes performs well on “unseen” data.

Advantages and Limitations

Naive Bayes is a highly efficient and straightforward probabilistic method, which makes it appealing for many practical applications. It requires relatively little training data to estimate parameters and can handle both binary and multi-class classification tasks. Additionally, due to its simplicity, Naive Bayes can perform extremely fast calculations on large datasets, often achieving surprisingly good accuracy. Even though Naive Bayes may seem too simple and “naive”, due to the fact that we make rather strict assumptions, it is a machine learning method that has been proved shown repeatedly to provide solid results. This is the case even if there are significant dependencies among the events. It is actually a topic of interest why this method generally performs very well in classification problems (Lantz, 2023; McCallum & Nigam, 1998). One possible explanation is that this method can provide estimations that are just good enough to classify the observations of interest correctly. In other words, if an unknown observation is classified correctly, it does not matter whether the estimated probability was slightly above 50% or close 100%.

Despite its simplicity and paradoxical performance, Naive Bayes has several limitations. First, it can only be used for classification tasks, not regression. Second, if features are highly correlated—that is, not independent—the method can struggle, since this violates its main assumption (Lantz, 2023; Zhang, 2004). Third, it cannot easily handle continuous variables. In our study, this was the reason we used categorical features instead of numeric ones. Although continuous variables can sometimes be transformed into categories, this reflects a limitation of the method, especially when most features in practice are continuous.

When applied to datasets with mostly independent features, Naive Bayes can be highly effective, providing a strong baseline performance in tasks such as text classification for spam email detection, credit risk assessment, real-time recommendation systems, and others.

Recap

Naive Bayes is a simple and efficient probabilistic method for classification tasks. It estimates the probability of a class for a given observation by applying Bayes’ Theorem, assuming that all features are conditionally independent and equally important. This “naive” assumption allows the method to compute predictions quickly and handle high-dimensional data, making it especially useful for classification tasks. Despite its simplicity and constraints, Naive Bayes can provide reliable baseline predictions, especially when features are largely independent.